A common strategy to deal with these issues is to build repeated (weak learners) models on the same data and combine them to form a single result.
These are called ensemble or consensus estimators/predictors.
As a general rule, ensemble learners tend to improve the results obtained with the weak learners they are made of.
Ensemble can be built on different learners but we will focus on those built on trees:
Two questions arise here:
The bootstrap has been applied to almost any problem in Statistics.
We begin with the easiest and best known case: estimating the standard error (that is the square root of the variance) of an estimator.
\[\begin{eqnarray*} \theta &=& E_F(X)=\theta (F) \\ \theta &=& Med(X)=\{m:P_F(X\leq m)=1/2\}= \theta (F). \end{eqnarray*}\]
\[\begin{eqnarray*} \hat{\theta}&=&\overline{X}=\int XdF_n(x)=\frac 1n\sum_{i=1}^nx_i=\theta (F_n) \\ \hat{\theta}&=&\widehat{Med}(X)=\{m:\frac{\#x_i\leq m}n=1/2\}=\theta (F_n) \end{eqnarray*}\]
An important when computing an estimator \(\hat \theta\) of a parameter \(\theta\) is how precise is \(\hat \theta\) as an estimator of \(\theta\)?
\[ \sigma _{\overline{X}}=\frac{\sigma (X)}{\sqrt{n}}=\frac{\sqrt{\int [x-\int x\,dF(x)]\sp 2dF(x)}}{\sqrt{n}}=\sigma _{\overline{X}}(F) \]
then, the standard error estimator is the same functional applied on \(F_n\), that is:
\[ \hat{\sigma}_{\overline{X}}=\frac{\hat{\sigma}(X)}{\sqrt{n}}=\frac{\sqrt{1/n\sum_{i=1}^n(x_i-\overline{x})^2}}{\sqrt{n}}=\sigma _{\overline{X}}(F_n). \]
The bootstrap method makes it possible to do the desired approximation: \[\hat{\sigma}_{\hat\theta} \simeq \sigma _{\hat\theta}(F_n)\] without having to to know the form of \(\sigma_{\hat\theta}(F)\).
To do this,the bootstrap estimates, or directly approaches \(\sigma_{\hat{\theta}}(F_n)\) over the sample.
The bootstrap allows to estimate the standard error from samples of \(F_n\), that is,
Substituting \(F_n\) by \(F\) carried out in the sampling step.
\[\begin{eqnarray*} &&\mbox{Instead of: } \\ && \quad F\stackrel{s.r.s}{\longrightarrow }{\bf X} = (X_1,X_2,\dots, X_n) \, \quad (\hat \sigma_{\hat\theta} =\underbrace{\sigma_\theta(F_n)}_{unknown}) \\ && \mbox{It is done: } \\ && \quad F_n\stackrel{s.r.s}{\longrightarrow }\quad {\bf X^{*}}=(X_1^{*},X_2^{*}, \dots ,X_n^{*}) \quad (\hat \sigma_{\hat\theta}= \hat \sigma_{\hat \theta}^* \simeq \sigma_{\hat \theta}^*). \end{eqnarray*}\]
Here, \(\sigma_{\hat \theta}^*\) is the bootstrap standard error of \(\hat \theta\) and
\(\hat \sigma_{\hat \theta}^*\) the bootstrap estimate of the standard error of \(\hat \theta\).
That is, the new (re-)sampling process consists of extracting samples of size \(n\) of \(F_n\):
\({\bf X^{*}}=(X_1^{*},X_2^{*},\dots ,X_n^{*})\) is a random sample of size \(n\) obtained with replacement from the original sample \((X_1,X_2,\dots ,X_n)\).
Samples \({\bf X^*}\), obtained through this procedure are called bootstrap samples or re-samples.
\[\begin{eqnarray*} \mathcal {L}(\hat \theta)&\simeq& P_F(\hat\theta \leq t): \mbox{Sampling distribution of } \hat \theta,\\ \mathcal {L}(\hat \theta^*)&\simeq& P_{F_n}(\hat\theta^* \leq t): \mbox{Bootstrap distribution of } \hat \theta, \end{eqnarray*}\]
This distribution is usually not known.
However the sampling process and the calculation of the statistics can be approximated using a Monte Carlo Algorithm.
\[ \mbox{if }B\rightarrow\infty \mbox{ then } \hat{\sigma}_B (\hat\theta) \rightarrow \hat\sigma_{\infty} (\hat\theta) =\sigma_B(\hat\theta)=\sigma_{\hat\theta}(F_n). \]
The bootstrap approximation, \(\hat{\sigma}_B(\hat\theta)\), to the bootstrap SE, \(\sigma_B(\hat\theta)\), provides an estimate of \(\sigma_{\hat\theta}(F_n)\):
\[ \hat{\sigma}_B(\hat\theta)(\simeq \sigma_B(\hat\theta)=\sigma_{\hat\theta}(F_n))\simeq\hat \sigma_{\hat\theta}(F_n). \]
From real world to bootstrap world:
\[\hat f_{bag}(x)=\frac 1B \sum_{b=1}^B \hat f^{*b}(x) \]
\[ \hat G_{bag}(x) = \arg \max_k \hat f_{bag}(x). \]
Since each out-of-bag set is not used to train the model, it can be used to evaluate performance.
Source: https://www.baeldung.com/cs/random-forests-out-of-bag-error
This example relies on the well-known AmesHousing dataset on house prices in Ames, IA.
We use libraries:
rpart for stratified resamplingipred for bagging.A complementary way to interpret a tree is by quantifying how important is each feature.
Done measuring the total reduction in loss function associated with each variable across all splits.
This measure can be extended to an ensemble simply by adding up variable importance over all trees built.
caretvip function from the vip package can be used (see lab examples).Random Forests Algorithm, from chapter 17 in (Hastie and Efron 2016)
# number of features
n_features <- length(setdiff(names(ames_train), "Sale_Price"))
# train a default random forest model
ames_rf1 <- ranger(
Sale_Price ~ .,
data = ames_train,
mtry = floor(n_features / 3),
respect.unordered.factors = "order",
seed = 123
)
# get OOB RMSE
(default_rmse <- sqrt(ames_rf1$prediction.error))
## [1] 24859.27There are several parameters that, appropriately tuned, can improve RF performance.
1 and 2 tend to have the largest impact on predictive accuracy.
The idea of improving weak learners by aggregation has moved historically along two distinct lines:
Idea: create a model that is better than any of its individual components by combining their strengths and compensating for their weaknesses.
For this, multiple weak models are trained sequentially, and each new model is trained to improve the errors made by the previous model.
The final model is a weighted combination of all the models, and the weights are determined by the accuracy of each model.
Boosting, like other Ensemble methods, improves the accuracy of weak learners and achieve better predictive performance than individual models.
Boosting also reduces overfitting by improving the generalization ability of models.
Available in many flavors,
Can be parallelized
Strong experience in Real world applications and industry.
Can be computationally expensive, especially when dealing with large datasets and complex models.
Can be sensitive to noisy data and outliers,
May not work well with certain types of data distributions.
Not so good as “out-of-the-box”: Requires careful tuning of hyperparameters to achieve optimal performance, which can be time-consuming and challenging.
The main steps in Gradient Boosting are:
Repeat steps 2-3 for a specified number of iterations or until convergence.
Multiple extensions from Gradient Boosting.
XGBoost
LightGBM
Fraud Detection
Image and Speech Recognition
Anomaly Detection
Medical Diagnosis
Amazon’s recommendation engine
Models that predict protein structures from amino acid sequences
Pattern identification in fMRI brain scans.